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(54) Software scheduled superscalar computer architecture 



(57) A computing system is described in which 
groups of individual instructions are executable in par- 
allel by processing pipelines, and instructions to be ex- 
ecuted in parallel by different pipelines are supplied to 
the pipelines simultaneously. During compilation of the 
instructions those which can be executed in parallel are 
identified. The system includes a register for storing an 
arbitrary number of the instructions to be executed. The 



instructions to be executed are tagged with pipeline 
identification tags and group identification tags indica- 
tive of the pipeline to which they should be dispatched, 
and the group of instructions which may be dispatched 
during the same operation. The pipeline and group iden- 
tification tags are used to dispatch the appropriate 
groups of instructions simultaneously to the differing 
pipelines. 
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Description 

[0001] This invention relates to a processor according 
to claim 1 ; a method according to claim 1 1 and a memory 
according to claim 21 , i.e. to the architecture of comput- 
ing systems, and in particular to an architecture in which 
groups of instructions may be executed in parallel, as 
well as to methods and apparatus for accomplishing 
that. 

[0002] A common goal in the design of computer ar- 
chitectures is to increase the speed of execution of a 
given set of instructions. Many solutions have been pro- 
posed for this problem, and these solutions generally 
can be divided into two groups. 
[0003] According to a first approach, the speed of ex- 
ecution of individual instructions is increased by using 
techniques directed to decreasing the time required to 
execute a group of instructions serially. Such techniques 
indude employing simple fixed-width instructions, pipe- 
lined execution units, separate instruction and data 
caches, increasing the dock rate of the instruction proc- 
essor, employing a reduced set of instructions, using 
branch prediction techniques, and the like. As a result it 
is now possible to reduce the number of docks to exe- 
cute an instruction to approximately one. Thus, in these 
approaches, the instruction execution rate is limited to 
the dock speed for the system. 
[0004] To push the limits of instruction execution to 
higher levels, a second approach is to issue more than 
one instruction per clock cycle, in other words, to issue 
instructions in parallel. This allows the instruction exe- 
cution rate to exceed the dock rate. There are two clas- 
sical approaches to parallel execution of instructions. 
[0005] Computing systems that fetch and examine 
several instructions simultaneously to find parallelism in 
existing instruction streams to determine if any can be 
issued together are known as superscaler computing 
systems. In a conventional superscaler system, a small 
number of independent instructions are issued in each 
clock cycle. Techniques are provided, however, to pre- 
vent more than one instruction from issuing if the instruc- 
tions fetched are dependent upon each other or do not 
meet other special criteria. There is a high hardware 
overhead associated with this hardware instruction 
scheduling process. Typical superscaler machines in- 
clude the Intel i960CA, the IBM RIOS, the Intergraph 
Clipper C400, the Motorola 88 1 1 0, the Sun SuperSparc, 
the Hewlett-Packard PA-RISC 7100, the DEC Alpha, 
and the Intel Pentium. 

[0006] Many researchers have proposed techniques 
for superscaler multiple instruction issue. Agerwala, T, 
and J. Cocke [1987] "High Performance Reduced In- 
struction Set Processors," IBM Tech. Rep. (March), pro- 
posed this approach and coined the name "superscaler. 
" IBM described a computing system based on these 
ideas, and now manufactures and sells that machine as 
the RS/6000 system. This system is capable of issuing 
up to four instructions per clock and is described in 'The 



IBM RISC System/6000 Processor," IBM J. of Res. & 
Develop. (January, 1990)34:1 . 
[0007] The other classical approach to parallel in- 
struction execution is to employ a "wide-word" or "very 

5 long instruction word" (VLIW) architecture. A VLIW ma- 
chine requires a new instruction set architecture with a 
wide-word format. A VLIW format instruction is a long 
fixed-width instruction that encodes multiple concurrent 
operations. VLIW systems use multiple independent 

10 functional units. Instead of issuing multiple independent 
instructions to the units, a VLIW system combines the 
multiple operations into one very long instruction. For 
example, in a VLIW system, multiple integer operations, 
floating point operations, and memory references may 

15 be combined in a single "instruction." Each VLIW in- 
struction thus includes a set of fields, each of which is 
interpreted and supplied to an appropriate functional 
unit. Although the wide-word instructions are fetched 
and executed sequentially, because each word controls 

20 the entire breadth of the parallel execution hardware, 
highly parallel operation results. Wide-word machines 
have the advantage of scheduling parallel operation 
statically, when the instructions are compiled. The fixed 
width instruction word and its parallel hardware, howev- 

25 er, are designed to fit the maximum parallelism that 
might be available in the code, and most of the time far 
less parallelism is available in the code. Thus for much 
of the execution time, most of the instruction bandwidth 
and the instruction memory are unused. 

30 [0008] There is often a very limited amount of paral- 
lelism available in a randomly chosen sequence of in- 
structions, especially if the functional units are pipe- 
lined. When the units are pipelined, operations being is- 
sued on a given clock cycle cannot depend upon the 

35 outcome of any of the previously issued operations al- 
ready in the pipeline. Thus, to efficiently employ VLIW, 
many more parallel operations are required than the 
number of functional units. 

[0009] Another disadvantage of VLIW architectures 
^0 which results from the fixed number of slots in the very 
long instruction word for classes of instructions, is that 
a typical VLIW instruction will contain information in only 
a few of its fields. This is ineffizient requiring the system 
to be designed for a circumstance that occurs only rarely 
45 - a fully populated instruction word. 

[001 0] Another disadvantage of VLIW systems is the 
need to increase the amount of code. Whenever an in- 
struction is not full, the unused functional units translate 
to wasted bits, no-ops, in the instruction coding. Thus 
50 useful memory and/or instruction cache space is filled 
with useless no-op instructions. In short, VLIW ma- 
chines tend to be wasteful of memory space and mem- 
ory bandwidth except for only a very limited class of pro- 
grams. 

55 [0011] Theterm VLIW was coined by J. A. Fisher and 
his colleagues in Fisher, J. A., J. R. Ellis, J. C. Rutten- 
berg, and A. Nicolau [1984], "Parallel Processing: A 
Smart Compiler and a Dumb Machine," Proc. SIGPLAN 
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Conf. on Compiler Construction (June), Palo Alto, CA, 
11-16. Such a machine was commercialized by Multi- 
flow Corporation. 

[0012] For a more detailed description of both super- 
sealer and VLIW architectures, see Computer Architec- 5 
ture - a Quantitative Approach, John L. Hennessy and 
David A. Patterson, Morgan Kaufmann Publishers, 
1990. 

[0013] The invention is defined in claims 1,11 and 21 . 
[0014] We have developed a computing system arch i- 10 
tecture, which we term software-scheduled superscaler, 
which enables instructions to be executed both sequen- 
tially and in parallel, yet without wasting space in the 
instruction cache or registers. Like a wide-word ma- 
chine, we provide for static scheduling of concurrent op- *5 
erations at program compilation. Instructions are also 
stored and loaded into fixed width frames (equal to the 
width of a cache line). Like a superscaler machine, how- 
ever, we employ a traditional instruction set, in which 
each instruction encodes only one basic operation 20 
(load, store, etc.). We achieve concurrence by fetching 
and dispatching "groups" of simple individual instruc- 
tions, arranged in any order. The architecture of our in- 
vention relies upon the compiler to assign instruction se- 
quence codes to individual instructions at the time they 25 
are compiled. During, execution these instruction se- 
quence codes are used to sort the instructions into ap- 
propriate groups and execute them in the desired order. 
Thus our architecture does not suffer the high hardware 
overhead and runtime constraints of the superscaler 30 
strategy, nor does it suffer the wasted Instruction band- 
width and memory typical of VLIW systems. 
[001 5] Our system includes a mechanism, an associ- 
ative crossbar, which routes in parallel each instruction 
in an arbitrarily selected group to an appropriate pipe- 35 
line, as determined by a pipeline tag applied to that in- 
struction during compilation. Preferably, the pipeline tag 
will correspond to the type of functional unit required for 
execution of that instruction, e.g., floating point unit 1. 
All instructions in a selected group can be dispatched 40 
simultaneously. 

[0016] Thus, in one implementation, our system in- 
cludes a cache line, register, or other means for holding 
at least one group of instructions to be executed in par- 
allel, each instruction in the group having associated 45 
therewith a pipeline identifier indicative of the pipeline 
for executing that instruction and a group identifier in- 
dicative of the group of instructions to be executed in 
parallel. The group identifier causes all instructions hav- 
ing the same group identifier to be executed simultane- so 
ously, while the pipeline identifier causes individual in- 
structions in the group to be supplied to an appropriate 
pipeline. 

[0017] In another embodiment the register holds mul- 
tiple groups of instructions, and all of the instructions in ss 
each group having a common group identifier are placed 
next to each other, with the group of instructions to be 
executed first placed at one end of the register, and the 



instructions in the group to be executed last placed at 
the other end of the register. 

[0018] In another embodiment of our invention a 
method of executing arbitrary numbers of instructions in 
a stream of instructions in parallel includes the steps of 
compiling the instructions to determine which instruc- 
tions can be executed simultaneously, assigning group 
identifiers to sets of instructions that can be executed in 
parallel, determining a pipeline for execution of each in- 
struction, assigning a pipeline identifier to each instruc- 
tion, and placing the instructions in a cache line or reg- 
ister for execution by the pipelines. 
[001 9] Further embodiments of the invention are: 

1 . A computing system for executing groups of in- 
dividual instructions in parallel by processing pipe- 
lines, the system comprising: 

storage means for holding at least one group of 
instructions to be executed in parallel, each in- 
struction in the group having associated there- 
with a pipeline identifi.er indicative of the pipe- 
line for executing that instruction and a group 
identifier indicative of the group of instructions 
to be executed in parallel; 
means responsive to the group identifier for 
causing all instructions having the same group 
identifier to be executed at the same time; and 
means responsive to the pipeline identifier of 
the individual instructions in the group to supply 
each instruction in the group to be executed in 
parallel to an appropriate pipeline. 

2. A computing system as in embodiment 1 wherein 
the storage means includes the at least one group 
of instructions, and for each instruction the storage 
means includes the group identifier and the pipeline 
identifier. 

3. A computing system as in embodiment 2 wherein 
each instruction in the at least one group of instruc- 
tions has associated therewith a different pipeline 
identifier. 

4. A computing system as in embodiment 1 wherein 
the storage means holds at least two groups of in- 
structions, and all of the instructions in each group 
having associated therewith a common group iden- 
tifier are placed adjacent to each other in the stor- 
age means. 

5. Acomputing system as in embodiment 4 wherein: 

the storage means comprises a line in a cache 
memory having a fixed number of storage loca- 
tions; 

the group of instructions to be executed first is 
placed at one end of the line in the cache mem- 
ory, and the instructions in the group to be ex- 
ecuted last is placed at the other end of the line 
in the cache memory. 
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6. A method of executing arbitrary numbers of in- 
structions in a stream of instructions in parallel 
which have been compiled to determine which in- 
structions can be executed at the same time, the 
method comprising: 5 

in response to the compilation assigning group 
identifiers to sets of instructions which can be 
executed in parallel; 

determining a pipeline for execution of each in- 10 
struction in a group to be executed; 

assigning a pipeline identifier to each in- 
struction in the group; and 
placing the instructions in a register for ex- 15 
ecution by the pipelines. 

7. A method as in embodiment 6 further comprising 
the step of executing a group of instructions in par- 
allel. 20 

8. A method as in embodiment 7 wherein the regis- 
ter holds at least two groups of instructions, and the 
step of placing the instructions in the register for ex- 
ecution by the pipelines comprises placing the in- 
structions in each group having associated there- 25 
with a common group identifier adjacent to each 
other in the register. 

9. A method as in embodiment 8 the step of execut- 
ing a group of instructions in parallel comprises cou- 
pling the register to detection means to receive the 30 
group identifier of each instruction in the register 
and the group identifier of the next group of instruc- 
tions to be supplied to the pipelines; and supplying 
only the instructions with the next group identifier to 
the pipeline execution units. 35 

10. In a computing system in which groups of indi- 
vidual instructions are executable in parallel by 
processing pipelines, a method for supplying each 
instruction in a group to be executed in parallel to 

an appropriate pipeline, the method comprising: 40 

storing in storage an instruction frame, the 
frame including at least one group of instruc- 
tions to be executed in parallel, each instruction 
in the group having associated therewith a *8 
pipeline identifier indicative of the pipeline 
which will execute that instruction and a group 
identifier indicative of the group identification; 
comparing the group identifier of each instruc- 
tion in the instruction frame and a group identi- so 
fier of those instructions to be next executed in 
parallel; and 

usingthe pipeline identifier of those instructions 
to be next executed in parallel to control an ex- 
ecution unit to execute all of the instructions in ss 
the group in separate pipelines. 

11. In a computing system in which groups of indi- 



vidual instructions are executable in parallel by 
processing pipelines, apparatus for routing each in- 
struction in a group to be executed in parallel to an 
appropriate pipeline, the apparatus comprising: 

storage for holding at least one group of instruc- 
tions to be executed in parallel, each instruction 
in the group having associated therewith a 
pipeline identifier indicative of the pipeline for 
executing that instruction and a group identifier 
to designate among the instructions present in 
the storage those instructions which may be si- 
multaneously supplied to the processing pipe- 
lines; 

a crossbar having a first set of connectors cou- 
pled to the storage for receiving instructions 
therefrom and a second set of connectors cou- 
pled to the processing pipelines; 
means responsive to the pipeline identifier of 
the individual instructions in the group for rout- 
ing individual instructions onto appropriate 
ones of the second set of connectors, to there- 
by supply each instruction in the group to be 
executed in parallel to the appropriate pipeline. 

12. Apparatus as in embodiment 11 wherein: 

the first set of connectors consists of a set of 
first communication buses, one for each in- 
struction in the storage; 
the second set of connectors consists of a set 
of second communication buses, one for each 
pipeline; and the means responsive to the pipe- 
line identifier comprises: 

a set of decoders coupled to the storage to 
receive as first input signals the pipeline 
identifiers and in response thereto supply 
as output signals a switch control signal; 
and 

a set of switches, coupled to the decoders, 
one switch atthe intersection of each of the 
first set of connectors with the second set 
of connectors, the switches providing con- 
nections in response to receiving the 
switch control signal to thereby supply 
each instruction in the group to be execut- 
ed in parallel to the appropriate pipeline. 

13. Apparatus as in embodiment 12 further com- 
prising: 

detection means coupled to receive the group 
identifier of each instruction in the storage and 
connected to receive information regarding the 
group identifier of the next group of instructions 
to be supplied to the pipelines, and in response 
thereto supply a group control signal; and 
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wherein the set of decoders coupled to the stor- 
age are also coupled to the detection means to 
receive the group control signal and in re- 
sponse thereto supplies a switch control signal 
for only those instructions in the group to be 5 
supplied to the pipelines. 

1 4. Apparatus as in embodiment 1 3 wherein the de- 
tection means comprises a multiplexer coupled to 
receive each of the group identifiers of instructions 10 
in the storage and compare them to the information 
regarding the group identifier of the next group of 
instructions to be supplied to the pipelines. 

15. Apparatus as in embodiment 14 wherein the 
multiplexer supplies an output signal to the decod- 15 
ers to indicate the group identifier of the next group 

of instructions to be supplied to the pipelines. 

16. In a computing system in which groups of indi- 
vidual instructions are executable in parallel by 
processing pipelines, apparatus for routing each in- 20 
struction in a group to be executed in parallel to an 
appropriate pipeline, the apparatus comprising: 

a storage for holding an instruction frame, the 
frame including at least one group of instruc- 25 
tions to be executed in parallel, each instruction 
in the group having associated therewith a 
pipeline identifier indicative of the pipeline to 
which that instruction is to be issued and a 
group identifier indicative of the group identifi- 30 
cation; 

a crossbar switch having a first set of connec- 
tors coupled to the storage for receiving instruc- 
tions therefrom and a second set of connectors 
coupled to the processing pipelines; 35 
selection means connected to receive the 
group identification of each instruction in the in- 
struction frame and connected to receive infor- 
mation about the group identifier of those in- 
structions to be next executed in parallel for *o 
supplying in response thereto an output signal 
indicative of the next set of instructions to be 
executed in parallel; and 
decoder means coupled to receive the output 
signal and each of the pipeline identifiers of the *s 
instructions in the storage for selectively con- 
necting ones of the first set of connectors to 
ones of the second set of connectors to thereby 
supply each instruction in the group to be exe- 
cuted in parallel to the appropriate pipeline. so 

1 7. Apparatus as in embodiment 1 6 wherein the first 
set of connectors consists of a set of first commu- 
nication buses, one for each instruction in the stor- 
age; 55 

the second set of connectors consists of a set 
of second communication buses, one for each 



pipeline; 

the decoder means comprises a set of decod- 
ers coupled to receive as first input signals the 
pipeline identifiers and as second input signals 
information about the group identifier of the 
next group of instructions to be executed by the 
pipelines and in response thereto supply as 
output signals a switch control signal; and 
the crossbar switch includes a set of switches, 
one at the intersection of each of the first set of 
connectors with the second set of connectors, 
the switches providing connections in response 
to receiving the switch control signal to thereby 
supply each instruction in the group to be exe- 
cuted in parallel to the appropriate pipeline. 

1 8. Apparatus as in embodiment 1 7 wherein the se- 
lection means coupled to the storage comprises a 
multiplexer coupled to receive each of the group 
identifiers of instructions in the storage and com- 
pare them to information regarding the group iden- 
tifier of the next group of instructions to be supplied 
to the pipelines. 

1 9. Apparatus as in embodiment 1 8 wherein the 
multiplexer supplies an output signal to the decod- 
ers to select the group identifier of the next group 
of instructions to be supplied to the pipelines. 

20. In a computing system in which groups of indi- 
vidual instructions are executable in parallel by 
processing pipelines, a method for transferring 
each instruction in a group to be executed through 
a crossbar switch having a first set of connectors 
coupled to the storage for receiving instructions 
therefrom and a second set of connectors coupled 
to the processing pipelines, the method comprising: 
storing in storage at least one group of instructions 
to be executed in parallel, each instruction in the 
group having associated therewith a pipeline iden- 
tifier indicative of the pipeline which will execute that 
instruction; and 

using the pipeline identifiers of the individual 
instructions in the at least one group of instructions 
which are to be executed next to control switches 
between the first set of connectors and the second 
set of connectors to thereby supply each instruction 
in the group to be executed in parallel to the appro- 
priate pipeline. 

21 . A method as in embodiment 20 wherein the step 
of using comprises: 

supplying the pipeline identifiers of the individ- 
ual instructions in the at least one group of in- 
structions to a corresponding number of decod- 
ers, each of which provides an output signal in- 
dicative of the pipeline identifiers; and 
using the decoder output signals to control the 
switches between the first set of connectors 
and the second set of connectors to thereby 
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supply each instruction in the group to be exe- 
cuted in parallel to the appropriate pipeline. 

22. A method as in embodiment 21 wherein each 

of the instructions in the storage further includes a 5 
group identifier to designate among the instructions 
present in the storage which may be simultaneously 
supplied to the processing pipelines, and the meth- 
od further comprises: 

10 

supplying information about the group identifier 
of the next group of instructions to be executed 
by the pipelines together with the group identi- 
fiers of the individual instructions in the at least 
one group of instructions to a selector; 15 
comparing the group identifier of the next group 
of instructions to be executed by the pipelines 
with the group identifiers of the individual in- 
structions in the at least one group of instruc- 
tions, to provide output comparison signals; 20 
and 

using both the output comparison signals and 
the decoder output signals to control the switch- 
es between the first set of connectors and the 
second set of connectors to thereby supply 25 
each instruction in the group to be executed in 
parallel to the appropriate pipeline. 

23. In a computing system in which groups of indi- 
vidual instructions are executable in parallel by 30 
processing pipelines, a method for supplying each 
instruction in a group to be executed in parallel to 

an appropriate pipeline, the method comprising: 

storing in storage an instruction frame, the 35 
frame including at least one group of instruc- 
tions to be executed in parallel, each instruction 
in the group having associated therewith a 
pipeline identifier indicative of the pipeline 
which will execute that instruction and a group *o 
identifier indicative of the group identification; 
comparing the group identifier of each instruc- 
tion in the instruction frame and a group identi- 
fier of those instructions to be next executed in 
parallel; and *s 
using the pipeline identifier of those instructions 
to be next executed in parallel to control switch- 
es in a crossbar switch having a first set of con- 
nectors coupled to the storage for receiving in- 
structions therefrom and a second set of con- so 
nectors coupled to the processing pipelines to 
thereby supply each instruction in the group to 
be executed in parallel to the appropriate pipe- 
line. 

55 

24. A computing system for executing groups of in- 
dividual instructions in parallel by processing pipe- 
lines, the system comprising: 



storage means for holding at least one group of 
instructions to be executed in parallel, each in- 
struction in the group having associated there- 
with a pipeline identifier indicative of the pipe- 
line for executing that instruction and a group 
identifier indicative of a group of instructions to 
be executed in parallel; 

means responsive to the group identifier for 
causing all instructions having the same group 
identifier to be executed at the same time; and 
means responsive to the pipeline identifier of 
the individual instructions in the group to supply 
each instruction in the group to be executed in 
parallel to an appropriate pipeline, 
wherein the storage means includes the at least 
one group of instructions, and for each instruc- 
tion the storage means includes the group iden- 
tifier and the pipeline identifier. 

25. The system of embodiment 24, wherein each 
instruction in the at least one group of instructions 
has associated therewith a different pipeline identi- 
fier. 

26. The system of embodiment 24 or 25, wherein 
the storage means holds at least two groups of in- 
structions, and all of the instructions in each group 
having associated therewith a common group iden- 
tifier are placed adjacent to each other in the stor- 
age means. 

27. The system of embodiment 26, wherein the stor- 
age means comprises a line in a cache memory 
having a fixed number of storage locations; the 
group of instructions to be executed first is placed 
at one end of the line in the cache memory, and the 
instructions in the group to be executed last is 
placed at the other end of the line in the cache mem- 
ory. 

28. The system of any one of the embodiments 24 
to 27, wherein a crossbar having a first set of con- 
nectors coupled to the storage means for receiving 
instructions therefrom and a second set of connec- 
tors coupled to the processing pipelines is provided. 

29. The system of embodiment 28, wherein the first 
and the second set of connectors, respectively, con- 
sist of a set of communication buses, one for each 
instruction in the storage means and one for each 
pipeline, respectively, and wherein a set of decod- 
ers coupled to the storage means to receive as first 
input signals the pipeline identifiers and in response 
thereto supply as output signals a switch control sig- 
nal and a set of switches, coupled to the decoders 
with one switch at the intersection of each of the first 
set of connectors, are provided, the switches pro- 
viding connections in response to receiving the 
switch control signal to thereby supply each instruc- 
tion in the group to be executed in parallel to the 
appropriate pipeline. 

30. The system of embodiment 29, further compris- 
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ing detection means coupled to receive the group 
identifier of each construction in the storage means 
and connected to receive information regarding the 
group identifier of the next group of instructions to 
be supplied to the pipelines, and in response there- 5 
to supply a group control signal, and wherein the 
set of decoders coupled to the storage means are 
also coupled to the detection means to receive the 
group control signal and in response thereto sup- 
plies a switch control signal for only those instruc- 10 
tions in the group to be supplied to the pipelines. 

31 . The system of embodiment 30, wherein the de- 
tection means comprises a multiplexer coupled to 
receive each of the group of identifiers of instruc- 
tions in the storage means and compare them to is 
the information regarding the group identifier of the 
next group of instructions to be supplied to the pipe- 
lines. 

32. The system of embodiment 31 , wherein the mul- 
tiplexer supplies an output signal to the decoders to 20 
indicate the group identifier of the next group of in- 
structions to be supplied to the pipelines. 

33. The system of any one of the embodiments 24 
to 32, wherein the group of instructions is an instruc- 
tion frame. 25 

34. The system of embodiment 33, further compris- 
ing selection means connected to receive the group 
identifier of each instruction in the instruction frame 
and connected to receive information about the 
group identifier of those instructions to be next ex- 30 
ecuted in parallel for supplying in response thereto 

an output signal indicative of the next set of instruc- 
tions to be executed in parallel, and decoder means 
coupled to receive the output signal and each of the 
pipeline identifiers of the instructions in the storage 35 
for selectively connecting ones of the first set of con- 
nectors to thereby supply each instruction in the 
group to be executed in parallel at the appropriate 
pipeline. 

35. A method for executing groups of individual in- *o 
structions in parallel by processing pipelines, the 
method comprising: 

storing in storage means tor holding at least 
one group of instructions to be executed in par- 45 
allel, each instruction in the group having asso- 
ciated therewith a pipeline identifier indicative 
of the pipelinefor executing that instruction and 
a group identifier indicative of a group of in- 
structions to be executed in parallel; 50 
comparing the group identifier of each instruc- 
tion in the group of instructions and a group 
identifier of those instructions to be next exe- 
cuted in parallel; and 

using the pipeline identifierof those instructions 55 
to be next executed in parallel to control an ex- 
ecution unit to execute all of the instructions in 
the group in separate pipelines. 



36. The method of embodiment 35, further compris- 
ing transferring each instruction in a group to be ex- 
ecuted through a crossbar having a first set of con- 
nectors coupled to the storage means for receiving 
instructions therefrom and a second set of connec- 
tors coupled to the processing pipelines and using 
the pipeline identifiers of the individual instructions 
in the at least one group of instructions which are 
to be executed next to control switches between the 
first set of connectors and the second set of con- 
nectors to thereby supply each instruction in the 
group to be executed In parallel to the appropriate 
pipeline. 

37. The method of embodiment 36, wherein the step 
of using comprises supplying the pipeline identifiers 
of the individual instructions in the at least one 
group of instructions to a corresponding number of 
decoders, each of which provides an output signal 
indicative of the pipeline identifiers; and using the 
decoder output signals to control the switches be- 
tween the first set of connectors and the second set 
of connectors. 

38. A computing system in which groups of instruc- 
tions are issued in parallel to processing pipelines 
(0, 7), the computing system comprising: 

storage means (50; 70, 74) for holding an in- 
struction frame, the instruction frame including 
a plurality of instructions including at least one 
group of instructions, the instructions in the at 
least one group of instructions to be issued in 
parallel, the at least one group being deter- 
mined at compile time. 

wherein the instruction frame including data as- 
sociated with the plurality of instructions in the 
instruction frame, the data indicative of which 
instructions are included in the one group of in- 
structions and further indicative of processing 
pipelines (0, 7) appropriate for the plurality 
of instructions in the instruction frame; and by 
means (60; 100) responsive to the data asso- 
ciated with the plurality of instructions for issu- 
ing the instructions in the one group of instruc- 
tions in parallel to processing pipelines (0, 
7) appropriate for the instructions in the one 
group of instructions. 

39. The computing system of embodiment 38, 
wherein at least one group of instructions compris- 
es at least two or at least three instructions. 

40. the computing system of embodiment 38 or 39, 
wherein the plurality of instructions also includes at 
least another instruction belonging to another group 
of instructions, the another instruction to be issued 
after the at least one group of instructions has been 
issued. 

41. The computing system of one of the embodi- 
ments 38 to 40, wherein the processing pipelines 



7 



13 



EP1 102 166 A2 



14 



(0, 7) appropriate for the instructions in the one 
group of instructions are respectively coupled to ex- 
ecution units appropriate for the instructions in the 
one group of instructions. 

42. The computing system of one of the embodi- 5 
ments 38 to 41 , wherein an execution unit appropri- 
ate for one instruction in the one group of instruc- 
tions is a memory or an arithmetic logic unit or a 
floating point unit or a branch unit. 

43. The computing system of one of the embodi- 10 
ments 38 to 42, wherein an execution unit appropri- 
ate for a first instruction in the one group of instruc- 
tions and an execution unit appropriate for a second 
instruction in the one group of instructions are sim- 
ilar. 15 

44. The computing system of one of the embodi- 
ments 38 to 43, wherein the plurality of instructions 
in the instruction frame are determined at compile 
time. 

45. A method for issuing groups of instructions in 20 
parallel to processing pipelines (0 7), the meth- 
od comprising: 

storing in a storage means (50; 70, 74) an in- 
struction frame, the instruction frame including 25 
a plurality of instructions including at least one 
group of instructions, the instructions in the at 
least one group of instructions to be issued in 
parallel, the at least one group being deter- 
mined at compile time, so 
wherein the instruction frame includes data as- 
sociated with the plurality of instructions in the 
instruction frame, the date indicative of which 
instructions are included in the one group of in- 
structions and further indicative of processing 35 
pipelines (0, 7) appropriate for the plurality 
of instructions in the instruction frame; and 
the one group of instructions is issued in paral- 
lel to processing pipelines (0, 7) appropriate 
for the instructions in the one group of instruc- 40 
tions in response to the data associated with 
the plurality of instructions in the instruction 
frame. 

46. The method of embodiment 45, further compris- 45 
ing a compilation step, during which the plurality of 
instructions in the instructions frame and the data 
associated with the plurality of instructions in the in- 
struction frame is determined. 

47. The method of embodiment 45 or 46, wherein so 
another instruction of the frame is issued to a 
processing pipeline (0, .... 7) appropriate for the an- 
other instruction in response to the data associated 
with the plurality of instructions in the instruction 
frame following the step of issuing the one group of 55 
instructions. 

48. The method of one of the embodiments 45 to 
47, wherein an instruction in the group is issued to 



an arithmetic logic unit or a floating point unit or a 
memory unit. 

49. The method of embodiment 48, wherein another 
instruction in the group is issued to an arithmetic 
logic unit or a floating point unit or a memory unit. 

50. A processor for operating a computing system 
according to the method of one of the embodiments 
45 to 49, wherein 

at least two processing pipelines (0, .... 7); 
storage means (50; 70, 74) for holding at least 
one group of instructions to be issued in paral- 
lel, and for holding data associated with the one 
group of instructions, the data indicative of 

processing pipelines (0 7) appropriate for 

the instructions in the at least one group of in- 
structions; and 

means (60; 100) responsive to the data for is- 
suing the instructions in the one group of in- 
structions in parallel to processing pipelines 
(0, .... 7) appropriate for the instructions in the 
one group of instructions, are provided for. 

Figure 1 is a block diagram illustrating a preferred 
implementation of this invention; 
Figure 2 is a diagram illustrating the data structure 
of an instruction word in this system; 
Figure 3 is a diagram illustrating a group of instruc- 
tion words; 

Figure 4 is a diagram illustrating a frame containing 
from one to eight groups of instructions; 
Figure 5a illustrates the frame structure for one 
maximum-sized group of eight instructions; 
Figure 5b illustrates the frame structure for a typical 
mix of three intermediate sized group of instruc- 
tions; 

Figure 5c illustrates the frame structure for eight 
minimum-sized groups, each of one instruction; 
Figure 6 illustrates an instruction word after prede- 
coding; 

Figure 7 illustrates the operation of the predecoder; 
Figure 8 is a diagram illustrating the overall struc- 
ture of the instruction cache; 
Figure 9 is a diagram illustrating the manner in 
which frames are selected from the instruction 
cache; 

Figure 10 is a diagram illustrating the group selec- 
tion function in the associative crossbar; 
Figure 11 is a diagram illustrating the group dis- 
patch function in the associative crossbar; 
Figure 12 is a diagram illustrating a hypothetical 
frame of instructions; and 
Figure 13 is a diagram illustrating the manner in 
which the groups of instructions in Figure 12 are is- 
sued on different clock cycles. 
Figure 14 is a diagram illustrating another embodi- 
ment of the associative crossbar. 
Figure 15 is a diagram illustrating the group select 
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function in further detail. 

[0020] Figure 1 is a block diagram of a computer sys- 
tem according to the preferred embodiment of this in- 
vention. Figure 1 illustrates the organization of the inte- 5 
grated circuit chips by which the computing system is 
formed. As depicted, the system includes a first integrat- 
ed circuit 10 that includes a central processing unit, a 
floating point unit, and an instruction cache. 
[0021] In the preferred embodiment the instruction 
cache is a 1 6 kilobyte two-way set-associative 32 byte 
line cache. A set associative cache is one in which the 
lines (or blocks) can be placed only in a restricted set of 
locations. The line is first mapped into a set, but can be 
placed anywhere within that set. In a two-way set asso- 
ciative cache, two sets, or compartments, are provided, 
and each line can be placed in one compartment or the 
other. 

[0022] The system also includes a data cache chip 20 
that comprises a 32 kilobyte four-way set-associative 32 
byte line cache. Thethirdchip30 of the system includes 
a predecoder, a cache controller, and a memory control- 
ler. The predecoder and instruction cache are explained 
further below. For the purposes of this invention, the 
CPU, FPU, data cache, cache controller and memory 
controller all may be considered of conventional design. 
[0023] The communication paths among the chips are 
illustrated by arrows in Figure 1. As shown, the CPU/ 
FPU and instruction cache chip communicates over a 
32 bit wide bus 12 with the predecoder chip 30. The as- 
terisk is used to indicate that these communications are 
multiplexed so that a 64 bit word is communicated in two 
cycles. Chip 10 also receives information over 64 bit 
wide buses 1 4, 1 6 from the data cache 20, and supplies 
information to the data cache 20 over three 32 bit wide 
buses 18. 

[0024] The specific functions of the predecoder are 
described in much greater detail below; however, es- 
sentially it functions to decode a 32 bit instruction re- 
ceived from the secondary cache into a 64 bit word, and 
to supply that 64 bit word to the instruction cache on chip 
10. 

[0025] The cache controller on chip 30 is activated 
whenever a first level cache miss occurs. Then the 
cache controller either goes to main memory or to the 
secondary cache to fetch the needed information, in the 
preferred embodiment the secondary cache lines are 32 
bytes and the cache has an 8 kilobyte page size. 
[0026] The data cache chip 20 communicates with the 
cache controller chip 30 over another 32 bit wide bus. 
In addition, the cache controller chip 30 communicates 
over a 64 bit wide bus 32 with the DRAM memory, over 
a 1 28 bit wide bus 34 with a secondary cache, and over 
a 64 bit wide bus 36 to input/output devices. 
[0027] As will be described, the system shown in Fig- 
ure 1 includes both conventional and novel features. 
The system includes multiple pipelines able to operate 
in parallel on separate instructions. The instructions that 



can be dispatched to these parallel pipelines simultane- 
ously, in what we term "instruction groups," have been 
identified by the compiler and tagged with a group iden- 
tification tag. Thus, the group tag designates instruc- 
tions that can be executed simultaneously. Instructions 
within the group are also tagged with a pipeline tag in- 
dicative of the specific pipeline to which that instruction 
should be dispatched. This operation is also performed 
by the compiler. 

[0028] In this system, each group of instructions can 
contain an arbitrary number of instructions ordered in 
an arbitrary sequence. The only limitation is that all in- 
structions in the group must be capable of simultaneous 
execution; e.g., there cannot be data dependency be- 
tween instructions. The instruction groups are collected 
into larger sets and are organized into fixed width 
"frames" and stored. Each frame can contain a variable 
number of tightly packed instruction groups, depending 
upon the number of instructions in each group and on 
the width of the frame. 

[0029] Below we describe this concept more fully, as 
well as describe a mechanism to route in parallel each 
instruction in an arbitrarily selected group to its appro- 
priate pipeline, as determined by the pipeline tag of the 
instruction. 

[0030] In the following description of the word, group, 
and frame concepts mentioned above, specific bit and 
byte widths are used for the word, group and frame. It 
should be appreciated that these widths are arbitrary, 
and can be varied as desired. None of the general mech- 
anisms described for achieving the result of this inven- 
tion depends upon the specific implementation. 
[0031] In one embodiment of this system the central 
processing unit includes eight functional units and is ca- 
pable of executing eight instructions in parallel. We des- 
ignate these pipelines using the digits 0 to 7. Also, for 
this explanation each instruction word is 32 bits (4 bytes) 
long, with a bit, for example, the high order bit S being 
reserved as a flag for group identification. Figure 2 
therefore shows the general format of all instructions. 
As shown by Figure 2, bits 0 to 30 represent the instruc- 
tion, with the high order bit 31 reserved to flag groups 
of instructions, i.e., collections of instructions the com- 
piler has determined may be executed in parallel. 
[0032] Figure 3 illustrates a group of instructions. A 
group of instructions consists of one to eight instructions 
(because there are eight pipelines in the preferred im- 
plementation) ordered in any arbitrary sequence; each 
of which can be dispatched to a different parallel pipeline 
simultaneously. 

[0033] Figure 4 illustrates the structure of an instruc- 
tion frame. In the preferred embodiment an instruction 
frame is 32 bytes wide and can contain up to eight in- 
struction groups, each comprising from one to eight in- 
structions. This is explained further below. 
[0034] When the instruction stream is compiled be- 
fore execution, the compiler places instructions in the 
same group next to each other in any order within the 
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group and then places that group in the frame. The in- 
struction groups are ordered within the frame from left 
to right according to their issue sequence. That is, of the 
groups of instructions in the frame, the first group to is- 
sue is placed in the leftmost position, the second group 5 
to issue is placed in the next position to the right, etc. 
Thus, the last group of instructions to issue within that 
frame will be placed in the rightmost location in the 
frame. As explained, the group affiliation of all instruc- 
tions in the same group is indicated by setting the S bit 
(bit 31 in Figure 2) to the same value. This valuetoggles 
back and forth from 0 to 1 to 0, etc., between adjacent 
groups to thereby identify the groups. Thus, all instruc- 
tions in the first group in a frame have the S bit set to 0, 
all instructions in the second group have the S bit set to 
1 , all instructions in the third group have the S bit set to 
0, etc., for all groups of instructions in the frame. 
[0035] To clarify the use of a frame , Figure 5 illustrates 
three different frame structures for different hypothetical 
groups of instructions. In Figure 5a the frame structure 
for a group of eight instructions, all of which can be is- 
sued simultaneously, is shown. The instruction words 
are designated WO, Wl, W7. The S bit for each one 
of the instruction words has been set to 0 by the com- 
piler, thereby indicating that all eight instructions can be 
issued simultaneously. 

[0036] Figure 5b illustrates the frame structure for a 
typical mixture of three intermediate sized groups of in- 
structions. In Figure 5b these three groups of instruc- 
tions are designated Group 0, Group 1 and Group 2. 
Shown at the left-hand side of Figure 5b is Group 0 that 
consists of two instruction words WO and W1 . The S 
bits for each of these instructions has been set to 0. 
Group 1 of instructions consists of three instruction 
words, W2, W3 and W4, each having the S bit set to 1 , 
Finally, Group 2 consists of three instruction words, W5, 
W6 and W7, each having its S bit set to 0. 
[0037] Figure 5c illustrates the frame structure for 
eight minimum sized groups, each consisting of a single 
instruction. Because each "group" of a single instruction 
must be issued before the next group, the S bits toggle 
in a sequence 01010101 as shown. 
[0038] As briefly mentioned above, in the preferred 
embodiment the group identif iers are associated with in- 
dividual instructions in a group during compilation. In the 
preferred embodiment, this is achieved by compiling the 
instructions to be executed using a well-known compiler 
technology. During the compilation, the instructions are 
checked for data dependencies, dependence upon pre- 
vious branch instructions, or other conditions that pre- 
clude their execution in parallel with other instructions. 
These steps are performed using a well-known compil- 
er. The result of the compilation is a group identifier be- 
ing associated with each instruction. It is not necessary 
that the group identifier be added to the instruction as a 
tag, as shown in the preferred embodiment and de- 
scribed further below. In an alternative approach, the 
group identifier is provided as a separate tag that is later 



associated with the instruction. This makes possible the 
execution of programs on our system, without need to 
revise the word width. 

[0039] In addition , in some embodiments the compiler 
will determine the appropriate pipeline for execution of 
an individual instruction. This determination is essential- 
ly a determination of the type of instruction provided. For 
example, load instructions will be sent to the load pipe- 
line, store instructions to the store pipeline, etc. The as- 
sociation of the instruction with the give pipeline can be 
achieved either by the compiler, or by later examination 
of the instruction itself, for example during predecoding. 
[0040] Referring again to Figure 1 , in normal opera- 
tion the CPU will execute instructions from the instruc- 
tion cache, according to well-known principles. On an 
instruction cache miss, however, the entire frame con- 
taining the instruction missed is transferred from the 
main memory into the secondary cache and then into 
the primary instruction cache, or from the secondary 
cache to the primary instruction cache, where it occu- 
pies one line of the instruction cache memory. Because 
instructions are only executed out of the instruction 
cache, all instructions ultimately undergo the following 
procedure. 

[0041] At the time a frame is transferred into the in- 
struction cache, the instruction word in that frame is pre- 
decoded by the predecoder 30 (Figure 1), which as is 
explained below decodes the retrieved instruction into 
a full 64 bit word. As part of this predecoding the S bit 
of each instruction is expanded to a full 3 bit field 000, 
001, 111, which provides the explicit binary group 
number of the instruction. In other words, the predecod- 
er, by expanding the S bit to a three bit sequence ex- 
plicitly provides information that the instruction group 
000 must execute before instruction group 010, al- 
though both groups would have all instructions within 
the group have S bits set to 0. Because of the frame 
rules for sequencing groups, these group numbers cor- 
respond to the order of issue of the groups of instruc- 
tions. Group 0 (000) will be issued first, Group 1 (001), 
if present, will be issued second, Group 2 (010) will be 
issued third. Ultimately, Group 7 (111), if present, will be 
issued last. At the time of predecoding of each instruc- 
tion, the S value of the last word in the frame, which 
belongs to the last group in the frame to issue, is stored 
in the tag field for that line in the cache, along with the 
19 bit real address and a valid bit. The valid bit is a bit 
that specifies whether the information in that line in the 
cache is valid. If the bit is not set to "valid," there cannot 
be a match or "hit" on this address. The S value from 
the last instruction, which S value is stored in the tag 
field of the line in the cache, provides a "countdown" val- 
ue that can be used to know when to increment to the 
next cache line. 

[0042] As another part of the predecoding process, a 
new 4 bit field prefix is added to each instruction giving 
the explicit pipe number of the pipeline to which that in- 
struction will be routed. The use of four bits, rather than 
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three allows the system to be later expanded with addi- 
tional pipelines. Thus, at the time an instruction is sup- 
plied from the predecoder to the instruction cache, each 
instruction will have the format shown in Figure 6. As 
shown by Figure 6, bits 0 to 56 provide 57 bits for the 
instruction, bits 57, 58 and 59 form the full 3 bit S field, 
and bits 60-63 provide the 4 bit P field. 
[0043] Figure 7 illustrates the operation of the prede- 
coder in transferring a frame from memory to the instruc- 
tion cache. In the upper portion of Figure 7, the frame is 
shown with a hypothetical four groups of instructions. 
The first group consists of a single instruction, the sec- 
ond group of three instructions, and each of the third 
and fourth groups of two instructions. As described, in- 
struction is 32 bits in length and include an S bit to sep- 
arate the groups. The predecoder decodes the instruc- 
tion shown in the upper portion of Figure 7 into the in- 
struction shown in the lower portion of Figure 7. As 
shown, the instructions are expanded to 64 bit length, 
with each instruction including a 4 bit identification of the 
pipeline to which the instruction is to be assigned, and 
the expanded group field to designate the groups of in- 
structions that can be executed together. For illustration, 
hypothetical pipeline tags have been applied. Addition- 
ally, the predecoder examines each frame for the mini- 
mum number of clocks required to execute the frame, 
and that number is appended to the address tag 45 for 
the line. The address tag consists of bits provided for 
the real address for the line, 1 bit to designate the validity 
of the frame, and 3 bits to specify the minimum time in 
number of clock cycles, for that frame to issue. The 
number of clocks for the frame to issue is determined 
by the group identification number of the last word in the 
frame. At this stage, the entire frame shown in the lower 
portion of Figure 7 is present in the instruction cache. 
[0044] It may be desirable to implement the system of 
this invention on computer systems that already are in 
existence and therefore have instruction structures that 
have already been defined without fields for the group 
information, pipeline information, or both. In this case in 
another embodiment of this invention the group and 
pipeline information is supplied on a different clock cy- 
cle, then combined with the instructions in the cache. 
Such an approach can be achieved by adding a "no-op", 
instruction with fields that identify which instructions are 
in which group, and identify the pipeline for execution of 
the instruction, or by supplying the information relating 
to the parallel instructions in another manner. It there- 
fore should be appreciated that the manner in which the 
data arrives at the crossbar to be processed is some- 
what arbitrary. We use the word "associated" herein to 
designate the concept that the pipeline and group iden- 
tifiers are not required to have a fixed relationship to the 
instruction words. That is, the pipeline and group iden- 
tifiers need not be imbedded within the instructions 
themselves as shown in Figure 7. Instead they may ar- 
rive from another means, or on a different cycle. 
[0045] Figure 8 is a simplified diagram illustrating the 



secondary cache, the predecoder, and the instruction 
cache. This drawing, as well as Figures 9, 10 and 11, 
are used to explain the manner in which the instructions 
tagged with the P and S fields are routed to their desig- 

5 nated instruction pipelines. 

[0046] In Figure 8 instruction frames are fetched in a 
single transfer across a 256 bit (32 byte) wide path from 
a secondary cache 50 into the predecoder 60. As ex- 
plained above, the predecoder expands each 32 bit in- 
to struction in the frame to its full 64 bit wide form and pre- 
fixes the P and S fields. After predecoding the 512 bit 
wide instruction is transferred into the primary instruc- 
tion cache 70. At the same time, tag is placed into the 
tag field 74 for that line. 

15 [0047] The instruction cache operates as a conven- 
tional physically-addressed instruction cache. In the ex- 
ample depicted in Figure 8, the instruction cache will 
contain 512 bit fully-expanded instruction frames of 
eight instructions each organized in two compartments 

20 of 256 lines. 

[0048] Address sources for the instruction cache ar- 
rive at a multiplexer 80 that selects the next address to 
be fetched. Because instructions are always machine 
words, the low order two address bits <1 :0> of the 32 

25 bit address field supplied to multiplexer 80 are discard- 
ed. These two bits designate byte and half-word bound- 
aries. Of the remaining 30 bits, the next three low order 
address bits <4:2>, which designate a particular instruc- 
tion word in a frame, are sent directly via bus 81 to the 

30 associative crossbar (explained in conjunction with sub- 
sequent figures). The next low eight address bits <12: 
5> are supplied over bus 82 to the instruction cache 70 
where they are used to select one of the 256 lines in the 
instruction cache. Finally, the remaining 19 bits of the 

35 virtual address <3 1 : 1 3> are sent to the translation looka- 
side buffer (TLB) 90. The TLB translates these bits into 
the high 19 bits of the physical address. The TLB then 
supplies them over bus 84 to the instruction cache. In 
the cache they are compared with the tag of the selected 

40 line, to determine if there is a "hit" or a "miss" in the in- 
struction cache. 

[0049] If there is a hit in the instruction cache, indicat- 
ing that the addressed instruction is present in the 
cache, then the selected frame containing the ad- 

45 dressed instruction is transferred across the 51 2 bit wide 
bus 73 into the associative crossbar 100. The associa- 
tive crossbar 100 then dispatches the addressed in- 
struction, with the other instructions in its group, if any, 
to the appropriate pipelines over buses 110, 111, 117. 

so Preferably the bit lines from the memory cells containing 
the bits of the instruction are themselves coupled to the 
associative crossbar This eliminates the need for nu- 
merous sense ampl if iers, and al lows the crossbar to op- 
erate on the lower voltage swing information from the 

55 cache line directly, without the normally intervening driv- 
er circuitry to slow system operation. 
[0050] Figure 9 is a block diagram illustrating in more 
detail the frame selection process. As shown, bits <4: 
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2> of the virtual address are supplied directly to the as- 
sociative crossbar 100 over bus 81. Bus 81, as ex- 
plained above will preferably include a pair of conduc- 
tors, the bit lines, for each data bit in the field. Bits <12: 
5> supplied over bus 82 are used to select a line in the 
instruction cache. The remaining 19 bits, translated into 
the 19 high order bits <31:13> of physical address, are 
used to compare against the tags of the two selected 
lines (one from each compartment of the cache) to de- 
termine if there is a hit in either compartment. If there is 
a hit, the two 51 2 bit wide frames are supplied to multi- 
plexer 120. The choice of which line is ultimately sup- 
plied to associative crossbar 100 depends upon the real 
address bits <31 : 1 3> that are compared by comparators 
125. The output from comparators 125 thus selects the 
appropriate frame for transfer to the crossbar 1 00. 
[0051] Figure 10 illustrates in more detail the group 
select function of the associative crossbar. A 512 bit 
wide register 130, preferably formed by the SRAM cells 
in the instruction cache contains the frame of the instruc- 
tions to be issued. For the purposes of illustration, reg- 
ister 130 is shown as containing a frame having three 
groups of instructions, with Group 0 including words 
WO, Wl and W2; Group 1 containing words W3, W4 and 
W5; and Group 2 containing words W6 and W7. For il- 
lustration, the instructions in Group 0 are to be dis- 
patched to pipelines 1 , 2 and 3; the instructions in Group 

1 to pipelines 1 , 3 and 6; and the instructions in Group 

2 to pipelines 1 and 6. The three S bits (group identifi- 
cation field) of each instruction in the frame are brought 
out to an 8:1 multiplexer 140 over buses 131, 132, 
133, 138.TheSfieldofthenextgroup of instructions 
to be executed is present in a 3 bit register 145. As 
shown in Figure 1 0, the hypothetical contents of register 
145 are 011 . These bits have been loaded into register 
145 using bus 81 described in conjunction with Figure 

9. Multiplexer 140 then compares the value in this reg- 
ister against the contents of the S field in each of the 
instruction words. If the two values match, the appropri- 
ate decoder 150 is enabled, permitting the instruction 
word to be processed on that clock cycle. If the values 
do not match , the decoder is disabled and the instruction 
words are not processed on that clock cycle. In the ex- 
ample depicted in Figure 1 0, the contents of register 1 45 
match the S field of the Group 1 instructions. The result- 
ing output, supplied over bus 142, is communicated to 
S register 144 and then to the decoders via bus 146. 
The S register contents enable decoders 153, 154 and 
155, all of which are in Group 001. As will be shown in 
Figure 1 1 , this will enable these instructions W3, W4 and 
W5 to be sent to the pipelines for processing. 

[0052] Figure 11 is a block diagram illustrating the 
group dispatching of the instructions in the group to be 
executed. The same registers are shown across the up- 
per portion of Figure 11 as in the lower portion of Figure 

1 0. As shown in Figure 1 1 , the crossbar switch itself con- 
sists of two sets of crossing pathways. In the horizontal 
direction are the pipeline pathways 180, 181, 187. 



the vertical direction are the instruction word paths, 1 90, 
191, .... 197. Each of these pipeline and instruction path- 
ways is themselves a bus for transferring the instruction 
word. Each horizontal pipeline pathway is coupled to a 
5 pipeline execution unit 200, 201 , 202, 207. Each of the 
vertical instruction word pathways 190, 191, .... 197 is 
coupled to an appropriate portion of register 130 (Figure 
10). 

[0053] The decoders 170, 171,..., 177 associated 
10 with each instruction word pathway receive the 4 bit 
pipeline code from the instruction. Each decoder, for ex- 
ample decoder 1 70, provides as output eight 1 bit con- 
trol lines. One of these control lines is associated with 
each pipeline pathway crossing of that instruction word 
is pathway. Selection of a decoder as described with ref- 
erence to Figure 10 activates the output bit control line 
corresponding to that input pipe number. This signals 
the crossbar to close the switch between the word path 
associated with that decoder and the pipe path selected 
by that bit line. Establishing the cross connection be- 
tween these two pathways causes a selected instruction 
word to flow into the selected pipeline. For example, de- 
coder 173 has received the pipeline bits for word W3. 
Word W3 has associated with it pipeline path 1 . The 
pipeline path 1 bits are decoded to activate switch 213 
to supply instruction word W3 to pipeline execution unit 
201 over pipeline path 181. In a similar manner, the iden- 
tification of pipeline path 3 for decoder D4 activates 
switch 234 to supply instruction word W4 to pipeline path 
3. Finally, the identification of pipeline 6 for word W5 in 
decoder D5 activates switch 265 to transfer instruction 
word W5 to pipeline execution unit 206 over pipeline 
pathway 186. Thus, instructions W3, W4 and W5 are 
executed by pipes 201 , 203 and 206, respectively. 
[0054] The pipeline processing units 200,201, ...,207 
shown in Figure 11 can carry out desired operations. In 
a preferred embodiment of the invention, each of the 
eight pipelines first includes a sense amplifier to detect 
the state of the signals on the bit lines. In one embodi- 
ment the pipelines include first and second arithmetic 
logic units; first and second floating point units; first and 
second load units; a store unit and a control unit. The 
particular pipeline to which a given instruction word is 
dispatched will depend upon hardware constraints as 
well as data dependencies. 

[0055] Figure 1 2 is an example of a frame and how it 
will be executed by the pipeline processors 200-207 of 
Figure 11. As shown in Figure 12, the frame includes 
three groups of instructions. The first group, with group 
identification number 0, includes two instructions that 
can be executed by the arithmetic logic unit, a load in- 
struction and a store instruction. Because all these in- 
structions have been assigned the same group identifi- 
cation number by the compiler, all four instructions can 
execute in parallel. The second group of instructions 
consists of a single load instruction and two floating 
point instructions. Again , because each of these instruc- 
tions has been assigned "Group 1 all three instructions 
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can be executed in parallel. Finally, the last instruction 
word in the frame is a branch instruction that, based up- 
on the compiler's decision, must be executed last. 
[0056] Figure 13 illustrates the execution of the in- 
structions in the frame shown in Figure 12. As shown, 5 
during the first clock the Group 0 instructions execute, 
during the second clock the load and floating point in- 
structions execute, and during the third clock the branch 
instruction executes. To prevent groups from being split 
across two instruction frames, an instruction frame may 
be only partially filled, where the last group is too large 
to fit entirely within the remaining space of the frame. 
[0057] Figure 14 is a diagram illustrating another em- 
bodiment of the associative crossbar. In Figure 14 nine 
pipelines 0 - 8 are shown coupled to the crossbar. The 
three bit program counter PC points to one of the in- 
structions in the frame, in combination with the set of 8 
group identification bits for the frame, indicating the 
group affiliation of each instruction, are used to enable 
a subset of the instructions in the frame. The enabled 
instructions are those at or above the address indicated 
by the PC that belong to the current group. 
[0058] The execution ports that connect to the pipe- 
lines specified by the pipeline identification bits of the 
enabled instructions are then selected to multiplex out 
the appropriate instructions from the current frame. If 
one or more of the pipelines is not ready to receive a 
new instruction, a set of hold latches at the output of the 
execution ports prevents any of the enabled instructions 
from issuing until the "busy" pipeline is free. Otherwise 
the instructions pass transparently through the hold 
latches into their respective pipelines. Accompanying 
the output of each port Is a "port valid" signal that indi- 
cates whether the port has valid information to issue to 
the hold latch. 

[0059] Figure 1 5 is a diagram illustrating the group se- 
lect function in further detail. This figure illustrates the 
mechanism used to enable an addressed group of in- 
structions within a frame. The program counter is first 
decoded into a set of 14 bit signals. Seven of these sig- 
nals are combined with the eight group identifiers of the 
current frame to determine whether each of the seven 
instructions, 1 1 to 1 7, is or is not the start of a later group. 
This information can then be combined with the other 7 
bit signals from the PC decoder to determine which of 
the eight instructions in the frame should be enabled. 
Using the pipeline identifying field each enabled instruc- 
tion can be combined with the other 7 bit signal to de- 
termine which of the eight instructions in the frame 
should be enabled. Each such enabled instruction can 
then signal the execution port, as determined by the 
pipeline identifier, to multiplex out the enabled instruc- 
tion. Thus, if 12 is enabled, and the pipeline code is 5, 
the select line from 12 to port 5 is activated, causing 12 
to flow to the hold latch at pipe 5. 
[0060] Because the instructions that start later groups 
are known, the system can decide easily which instruc- 
tion starts the next group. This information is used to 



update the PC to the address of the next group of in- 
structions. If no instruction in the frame begins the next 
group, i.e., the last instruction group has been dis- 
patched to the pipelines, a flag is set. The flag causes 
the next frame of instructions to be brought into the 
crossbar. The PC is then reset to 10. Shown in the figure 
is an exemplary sequence of the values that the PC, the 
instruction enable bits and the next frame flag take on 
over a sequence of eight clocks extending over two 
frames. 

[0061] The processor architecture described above 
provides many unique advantages to a system using 
this invention. The system described is extremely flexi- 
ble, enabling instructions to be executed sequentially or 
in parallel, depending entirely upon the "intelligence" of 
the compiler. As compiler technology improves, the de- 
scribed hardware can execute programs more rapidly, 
not being limited to any particular frame width, number 
of instructions capable of parallel execution , or other ex- 
ternal constraints. Importantly, the associative crossbar 
aspect of this invention relies upon the content of the 
message being decoded, not upon an external control 
circuit acting independently of the instructions being ex- 
ecuted. In essence, the associative crossbar is self di- 
rected. In the preferred embodiment the system is ca- 
pable of a parallel issue of up to eight operations per 
cycle. For a more complete description of the associa- 
tive crossbar, see EP-A-0 652 509. 
[0062] Although the foregoing has been a description 
of the preferred embodiment of the invention, it will be 
apparent to those of skill in the art that numerous mod- 
ifications and variations may be made to the invention 
without departing from the scope as described therein. 
For example, arbitrary numbers of pipelines, arbitrary 
numbers of decoders, and different architectures may 
be employed, yet rely upon the system we have devel- 
oped. 



Claims 

1. A processor comprising: 

a register file having a plurality of registers; 
an instruction set including instructions (IN- 
STRUCTION) which address the registers, 
each instruction (INSTRUCTION) being one of 
a plurality of instruction types; 
a plurality of execution units (0...7), each exe- 
cution unit (0...7) being one of a plurality of 
types, wherein each instruction type is execut- 
ed on one or more execution unit types; 
and further wherein the instructions (IN- 
STRUCTION) are encoded in frames, each 
frame including a plurality of instructions (IN- 
STRUCTION) and template bits (S,P) grouped 
together in an N : bit field, the instructions (IN- 
STRUCTION) being located in instructions 
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slots of the N-bit field, the template bits (S,P) 
specifying a mapping of the instruction slots to 
the execution unit types. 

2. The processor of claim 1 wherein the template bits 
further specify instruction group boundaries within 
the frame, with an instruction group comprising a 
set of statically contiguous instructions that are ex- 
ecuted concurrently. 

3. The processor of claim 1 or 2 wherein the instruction 
types include integer arithmetic logic unit, memory, 
floating-point, and branch instructions. 

4. The processor of one of the claims 1 to 3 wherein 
the execution unit types include integer, memory, 
floating-point, and branch execution units. 

5. The processor of one of the claims 1 to 4 wherein 
the frame further includes a stop-bit that specifies 
an inter-frame instruction group boundary. 

6. The processor according to one of the claims 1 to 
5, further comprising a memory that stores the 
frames, a byte order of the frames in the memory 
being in a little-endian format or in a big-endian for- 
mat. 

7. The processor according to one of the claims 1 to 

6 wherein the frames are ordered in the memory 
from a lowest to a highest memory address. 

8. The processor according to one of the claims 1 to 

7 wherein an instruction in the frame with the lowest 
memory address precedes an instruction in the 
frame with the highest memory address. 

9. The processor according to one of the claims 1 to 

8 wherein the template bits are partly determined 
by hardware. 

10. The processor according to one of the claims 1 to 

9 wherein the template bits are at least partly deter- 
mined at a compile time. 

1 1 . A method for operating a processor comprising: 

storing a frame of instructions (INSTRUCTION) 
in a memory, the frame including a plurality of 
instructions and template bits (S, P), the plural- 
ity of instructions (INSTRUCTIONS) and the 
template bits (S, P) grouped together in an N- 
bit field, the instructions (INSTRUCTION) being 
located in instruction slots of the N-bit field, the 
template bits (S, P) specifying a mapping of the 
instruction slots to execution unit types, each 
instruction (INSTRUCTION) being one of a plu- 
rality of instruction types from an instruction set 



which address a plurality of registers in a reg- 
ister file; and 

executing each instruction type on an execution 
unit (0...7) from a plurality of execution units 
5 (0...7) being one of a plurality of execution unit 

types. 

12. The method according to claim 11 comprising the 
step of specifying instruction group boundaries 

10 within the frame at the template bits, with an instruc- 
tion group comprising a set of statically contiguous 
instructions that are executed concurrently. 

13. The method according to claim 11 or 12 comprising 
15 the step of including a stop-bit into the frame, said 

stop-bit specifying an inter-frame instruction group 
boundary. 

14. The method according to claim 13 comprising the 
20 step of specifying the instruction group boundary to 

occur after a last instruction of a current frame if the 
stop-bit is in a first condition. 

15. The method according to claim 14 comprising the 
25 step of specifying the instruction group to include 

the last instruction of the current frame if the stop- 
bit is in a second condition. 

1 6. The method according to one of the claims 11 to 1 5 
30 comprising the step of storing the frames in a mem- 
ory with a byte order in a little-endian or a big-endian 
format. 

17. The method according to one of the claims 11 to 1 6 
35 comprising the step of ordering the frames in the 

memory from a lowest to a highest memory ad- 
dress. 

18. The method according to one of the claims 11 to 17, 
40 wherein the instruction types include integer arith- 
metic logic unit, memory, floating point, and branch 
units. 

19. The method according to one of the claims 11 to 1 8 
45 comprising the step of determining the template bits 

partly by hardware. 

20. The method according to one of the claims 1 1 to 1 9 
comprising the step of determining the template bits 

so at least partly at compile time. 

21. A memory comprising: 

a frame of instructions (INSTRUCTION), the 
55 frame including a plurality of instructions (IN- 

STRUCTION) and template bits (S,P), the plu- 
rality of instructions (INSTRUCTION) and the 
template bits (S,P) grouped together in an N- 
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bit field, the instructions (INSTRUCTION) being 
located in instruction slots of the N-bit field, the 
template bits (S,P) specifying a mapping of the 
instruction slots to execution unit types, each 
instruction (INSTRUCTION) being one of a plu- 5 
rality of instruction types from an instruction set 
and which address a plurality of registers in a 
register file, 

wherein each instruction type is to be executed 
on an execution unit (0...7) from a plurality of 10 
execution units (0...7) being one of a plurality 
of execution unit types, and 
wherein instructions in the frame may be issued 
to respective execution units (0...7) at different 
times. 15 

22. The memory of claim 21 wherein the template bits 
further specify instruction group boundaries within 
the frame, with an instruction group comprising a 

set of statically contiguous instructions that are ex- 20 
ecuted concurrently. 

23. The memory of claim 21 or 22 wherein the frame 
further includes a stop-bit that specifies an inter- 
frame instruction group boundary. 25 

24. The memory of claim 23 wherein, if the stop-bit is 
in a first condition, the instruction group includes the 
last instruction of the current frame. 

30 

25. The memory according to one of the claims 21 to 

24 wherein a byte order of the frames stored therein 
is in a little-endian format or in a big-endian format. 

26. The memory according to one of the claims 21 to 35 

25 wherein the frames are ordered from a lowest to 
a highest memory address. 

27. The memory according to one of the claims 21 to 

26 wherein an instruction in the frame with the low- 40 
est memory address precedes an instruction in the 
frame with the highest memory address. 

28. The memory according to one of the claims 21 to 

27 further comprising an unused encoding of the 45 
template bits being available for use in a future ex- 
tension. 

29. The memory according to one of the claims 21 to 

28 wherein the template bits are derivable from ma- 50 
chine code. 

30. The memory according to one of the claims 21 to 

29 wherein the template bits are determinable at 
least partly at a compile time. 55 



15 



EP 1 102 166 A2 



3X32 



20 



DATA-CACHE 



32 



32 KB FOUR-WAY 
SET-ASSOCIATIVE 
6 32 BYTE LINE 



128 



64 



2X64 JC 
18 



-14 



'64 



-16 



128 64 
CPU/FPU 
INSTRUCTION 
CACHE 

16 KB TWO-WAY 
SET-ASSOCIATIVE 
32 BYTE UNE 

32 



V 

10 



22 



32 



12 



32 



SECONDARY CACHE 



34 



128 



32 



128 



PREDECODER 
AND 
CACHE 
CONTROLLER 
AND 
MEMORY 
CONTROLLER 



32 



64 



64 



-36 
I/O (586 BUS) 
PCI BUS 



■30 



64 DRAM 
' A ■ MEMORY 
\ 1M-4M 

32 



FIG. 1 



16 



EP 1 102 166 A2 



31 30 

"iT INSTRUCTION 

1 31 

FIG. 2 



INSTRUCTION 1 



INSTRUCTION 2 



-Vr 



INSTRUCTION n 



FIG. 3 



WO 


W1 


W2 


W3 


W4 


W5 


W6 


W7 


32 BYTES ► 



FIG. 4 



EP1 102 166 A2 



0 


0 


0 


0 


0 


0 


0 


0 


wo 


W1 


W2 


W3 


W4 


W5 


W6 


W7 


[ GROUP 0 ] 



FIG. 5A 



s= o 



1 



WO W1 W2 W3 W4 W5 W6 W7 



GROUP 0 1 [ GROUP 1 ] [ GROUP 2 ] 

FIG. 5B 



S= 0 1 0 1 0 1 0 1 



wo 


W1 


W2 


W3 


W4 


W5 


W6 


W7 


[GO] 


[G1] 


[G2] 


[G3] 


[G4] 


IG5] 


[G6] 


IG7J 



FIG. 5C 



63 59 56 



INSTRUCTION 
57* 



FIG. 6 



18 



EP1 102 166 A2 




DC 

o 



CO 

GL 

O 

o 



CO 
LU 

2 



€0 

Ox 



2 



CM 
Ol 

o 



cn 

LUO 
HZ 
DO 
OO 
LUUJ 

CO 



rB 

o 



CO 

UJu. 
2 



19 



EP1 102 166 A2 



50- 



60- 



SECONDARY 
CACHE 



/256 



PRE-DECODER 



70- 



84 



PHYSICAL 
ADDRESS 



,'19 



90 



TLB 



VIRTUAL S 
ADDRESS 
BITS [31:13] 



19 



■82 



512 



'8 73-^>5i2 100 
/•BITS [12:5] \ J 



,'76 74 



2-WAYSET 




ASSOCIATE 1 




PRIMARY 256 


TAGS 


INSTRUCTION 1 




CACHE | 





ASSOCIATIVE CROSSBAR 



7^V 

y BITS [31 -2] 
80 



MUX 



PIPE 
0 



117- 



PIPE 
1 



64 



PIPE 
7 



TTTk 

ADDRESSES 



32 BITS 



FIG. 8 



20 



EP 1 102 166 A2 




120 



^100 



PIPEO PIPE 1 



FIG. 9 



21 



EP1 102 166 A2 




22 



EP1 102 166 A2 




23 



EP1 102 166 A2 



| ALU 


ALU 


LOAD 


STORE 


LOAD 


FP 


FP 


Br 


0 


0 


0 


0 


1 


1 


1 


2 



GROUP TAG 

FIG. 12 



CLOCK 1— ALU ALU LOAD STORE 

CLOCK 2— LOAD FP FP 

CLOCK 3— BRANCH 

FIG. 13 



24 



EP 1 102 166 A2 




EP 1 102 166 A2 



NEXT FRAME 
CURRENT FRAME 



GROUP IDS 
17 16 15 1413 12 11 10 



QOODQDQQ 



□□□□□□□□ 



INSTRUCTION ENABLES 



CLK1 



17 16 15 14 13 12 11 10 



00Q00DUU 



CLK2 
CLK3 
CLK4 
CLK5 
CLK6 
CLK7 
CLK8 



PC 

mm 



TToITI 



oToTol 



mlti 



mm 



mm 



'8 



LATER GROUP? 



'7 



DECODE 



INST. ENABLE 



0Q0EH1Q00 



DDQ0BQ0I3 



□□□□□□□□ 



000110000 



00000000 



NFF 

m 
m 

m 



'3 



000 



PROGRAM 
COUNTER 
(©CLOCK 1) 



NEXT 
GROUP? 



NEXT 
FRAME 
FLAG 



z 
o 

m 



4oJ 



3) 

m 

CO 

.3 



FIG. 15 



26 



